Normalizing Medieval German Texts: from rules to deep learning
نویسنده
چکیده
The application of NLP tools to historical texts is complicated by a high level of spelling variation. Different methods of historical text normalization have been proposed. In this comparative evaluation I test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation. Character based neural machine translation, not being previously tested for the task of normalization, showed the best results.
منابع مشابه
Computing distance and relatedness of medieval text variants from German
In this paper, we explore several ways as to computing similarity between medieval text variants from German. In comparing these texts, we apply methods from word and sentence alignment and compute cosine similarity based on character and part-of-speech ngrams. The resulting similarity (or distance) scores are visualized by phylogenetic trees; the methods correctly reproduce the well-known dist...
متن کاملPredicting the Past: Memory Based Copyist and Author Discrimination in Medieval Epics
In this paper we will focus on the scribal variation in manually copied medieval texts. Using a lazy machine learning technique, we will argue that it is possible to discriminate between scribes, implying that they did adapt texts when copying them. Consequently, we will assess to what extent scribal interventions compromise our ability to detect the original authorship of medieval texts. It wi...
متن کاملClassification of Chest Radiology Images in Order to Identify Patients with COVID-19 Using Deep Learning Techniques
Background and Aim: Due to the important role of radiological images for identifying patients with COVID-19, creating a model based on deep learning methods was the main objective of this study. Materials and Methods: 15,153 available chest images of normal, COVID-19, and pneumonia individuals which were in the Kaggle data repository was used as dataset of this research. Data preprocessing inc...
متن کاملBioinformaticsUA: Machine Learning and Rule-Based Recognition of Disorders and Clinical Attributes from Patient Notes
Natural language processing and text analysis methods offer the potential of uncovering hidden associations from large amounts of unprocessed texts. The SemEval-2015 Analysis of Clinical Text task aimed at fostering research on the application of these methods in the clinical domain. The proposed task consisted of disorder identification with normalization to SNOMED-CT concepts, and disorder at...
متن کاملImproving historical spelling normalization with bi-directional LSTMs and multi-task learning
Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previou...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017